UNIGE Experiments on Robust Word Sense Disambiguation

نویسندگان

  • Jacques Guyot
  • Gilles Falquet
  • Saïd Radhouani
  • Karim Benzineb
چکیده

This task was meant to compare the results of two different retrieval techniques: the first one was based on the words found in documents and query texts; the second one was based on the senses (concepts) obtained by disambiguating the words in documents and queries. The underlying goal was to come up with a more precise knowledge about the possible improvements brought by word sense disambiguation (WSD) in the information retrieval process. The proposed task structure was interesting in that it drew up a clear separation between the actors (humans or computers): those who provide the corpus, those who disambiguate it, and those who query it. Thus it was possible to test the universality and the interoperability of the methods and algorithms involved. Training and Testing Data The document corpus was created by merging two collections of English documents : LA Times 94 and Glasgow Herald 95 (166'000 documents with 470'000 unique words and 55 million word occurrences). This corpus was processed with two different word sense disambiguation algorithms: UBC [ubc07] and NUS [nus07], resulting in two different sets. The disambiguation process replaced each occurrence of a term (composed by one or more words) by an XML element containing the term identifier, an extracted lemma, a part-of-speech (POS) tag (noun, verb, adjective...), the original word form (WF) and a list of senses together with their respective scores. The senses were represented by WordNet 1.6 synset identifiers. For instance, the word "discovery" could be replaced by: discovery ... A training set of 150 queries (topics) was provided together with the expected results, as well as a testing set containing 160 queries. As usual, the queries included three parts: a title (T), a description (D) and a narrative (N). The English queries were processed with the UBC and NUS disambiguation algorithms, while the Spanish queries were disambiguated with the first sense heuristics (FSH), i.e. always choosing the first sense available. Experiments Indexer To index the corpus, we chose the IDX-VLI indexer described in [gfb06] because it can gather a wealth of information (positions, etc.), it has built-in operators and it is remarkably fast. Still, we only used the basic version of that indexer, i.e. we did not use any relevance feedback mechanism, context description or any other sophisticated tool of that sort. We thus avoided interfering with the direct results of the experiment and we facilitated the result analysis. Collection processing We developed and tested several document processing strategies on the provided collections. Those strategies were applied to each element within each document: • NAT : Keep only the word form of each element (i.e. rebuild the original text) • LEM : Keep only the lemma • POS : Keep the lemma and the part-of-speech tag • WSD : Keep only the synset with the best score \footnotemark. • WSDL : Keep the best synset and the lemma. During the indexing process the strategies were applied to all the terms, including numbers, except for the stopwords. Given the poor performance of the POS approach, we quickly gave up this option. Topic processing The same translators were applied to the queries, with an extended stop-word list including words such as report, find, etc. For each topic we derived three queries: • T : Include only the title part • TD : Include the title and description translated terms • TDN : Include the title, description and narrative translated terms. In order to come up with a reasonably good base line we tested several approaches to build a Boolean pre-filter from a given topic (results are the mean average precision (MAP) on T): • OR (25.5%) : The logical OR of the terms (or lemmas) • AND (15.8%) : The logical AND of the terms • NEAR (15.2%) : The logical OR of all the pairs (ti NEAR tj) where ti and tj are the query terms • AND-1 (23.6%): The logical OR of all the possible conjunctions of terms, except for the conjunction of all the terms. The best results in terms of MAP were produced by the OR filtering, followed by the computation of a relevance score based on the Okapi BM25 weighting model (with default parameters). The test was carried out on the titles (T) of 150 training topics. More restrictive filtering schemes were tried out but did not perform any better, probably because of the relatively small size of the corpus. Runs with word senses: For the disambiguation-based runs we tried out several other filtering schemes, including: • OR (22.4%) : The logical OR of the best synset corresponding to a topic term • AND (15.1%) : The logical AND of the best synset corresponding to a topic term • NEAR (12.5%) : The logical OR of all the pairs (si NEAR sj) where si and sj are the best synsets corresponding to a topic term ti and tj • AND-1 (18.8%): The logical OR of all the possible conjunctions of synsets, except for the conjunction of all the synsets • HYPER (14.3%): The logical AND of each (si OR hi) where si is the best synset corresponding to a topic term ti and hi is the direct hypernym of si in WordNet • ORHYPER(18.43%): The logical OR of each (si OR hi) where si is the best synset corresponding to a topic term ti and hi is the direct hypernym of si in WordNet. However, none of these strategies performed any better than the basic OR filter on terms. 1 This amounts to considering that the disambiguation algorithm is "perfect". Alternatively we could have added all the synsets with a score greater than a given threshold. Result summary The first table below shows the mean average precision (in percent) calculated on the training queries with different query processing options (disambiguation algorithm and part-of-topic selection) and different document processing options (disambiguation and translation). Of course, the processing (NAT, LEM, WSD or WSDL) was always the same throughout the queries and the corpus for a given run. The base line was the run with topic selection TDN and term selection LEM (i.e. the whole topic with stemming). The second table shows the results of the testing queries, which are slightly better than those of the training queries (maybe the testing queries were somewhat easier). The base line (LEM) for the Spanish queries was created by automatically translating the queries from Spanish into English. The tests on the NUS corpus produced better results than those on the UBC one. Therefore most of the runs were performed on the NUS corpus, while the UBC corpus would be used to test the interoperability of the disambiguation processes. avg precision on TRAINING requests (OR strategy) Document processing Base Line NUS UBC NAT LEM WSD WSD+LEM WSD Request NONE T 25.2% 27.0% TDN 31.9% NUS T 22.4% 26.0% TDN 28.8% 32.5% 24.9% UBC T 22.6% TDN 22.4% 25.4% ESP T 4.0% TDN 6.6% 6.2% avg precision on TESTING requests (OR strategy) Document processing Base Line NUS UBC LEM WSD WSD+LEM WSD Request NONE T 30.64% TD 36.64% TDN 39.17% NUS T 21.20% TD 29.34% TDN 32.69% 38.14% UBC TDN 29.62% Trans. ESP T 30.36% FSH-ESP T 8.46% TDN 9.70% Findings and Discussion In the tables above we note the following facts: • Using the D and N parts-of-topics increases the precision in all cases (with and without WSD). This is probably due to the ranking method which benefits from the additional terms provided by D and N. • On the test run with UBC disambiguation, the senses alone (WSD) decrease the MAP: -4.6% on T queries and -3.1% on TDN. On training requests, adding the lemmas to the senses (WSDL) slightly improves the MAP (+0.6%). This is the only case where disambiguation brings any improvement. • Using different disambiguation algorithms for the queries and the documents noticeably decreases the results. This should not happen if the algorithms were perfect. It shows that disambiguation acts as a kind of encoding process on the words, and obviously the best results are obtained when the same encoding, producing the same mistakes, is applied to both queries and documents. Thus, at this stage, the disambiguation algorithms are not interoperable. We carefully analyzed about 50 queries to better understand what happened with the disambiguation process. For instance, the query with the title "El Niño and the weather" was disambiguated as follows (NUS): • El was understood as the abbreviation el. of elevation • Niño was understood as the abbreviation Ni of nickel, probably because the parser failed on the nonASCII character ñ • weather was correctly understood as the weather concept. Although the disambiguation was incorrect, WSD was as good as LEM because the "encoding" was the same in the collection and in the query and there were few or no documents about nickel that could have brought up noise. More generally, when the WSD results were better than the LEM ones, it was not due to semantic processing but to contingencies. For instance, the query title "Teenage Suicides" had a better score with WSD because teenage was not recognized! Thus the query became suicides, which is narrower than teenage OR suicide and, on this corpus, avoids retrieving a large amount of irrelevant documents about teenagers. A few items of the test run are commented in Appendix A. The poor performance on Spanish queries is due to 1) the above-mentioned lack of interoperability between the different WSD algorithms, and 2) the low quality of the Spanish WSD itself. This can be illustrated with some examples: On Question 41: "Pesticide in baby food" is translated by "Pesticidas en alimentos para bebes" and is then converted into the FOOD and DRINK (verb) concepts because bebes is a conjugated form of beber, which is the Spanish verb for drink. On Question 43: "El Niño and the weather" is translated by "El Niño y el tiempo" and is then converted into the CHILD and TIME concepts because Niño is the Spanish noun for child and tiempo is an ambiguous word meaning both time and weather. Given those difficulties, outstanding results could not be expected. Looking back on the questions and results, it can be noted that 1'793 documents were retrieved out of the 2'052 relevant ones, i.e. almost 90% of them. The core issue is to sort out the documents so as to reject those whose content does not match users' expectations. A closer look at our results on the Training corpus showed that we got a pretty good performance on some of the requests. This does not mean that our search engine understood the said requests correctly; it is simply due to the fact that the corpus included only good matches for those requests, so it was almost impossible to find wrong answers. For instance, on Question 50 about "the Revolt in Chiapas", we retrieved 106 documents out of the 107 relevant ones with an average precision of 87%. This is due to the fact that in the corpus, the Chiapas are only known for their revolt (in fact if we google the word "Chiapas" a good proportion of the results are currently about the Chiapas rebellion). On the other hand, on Question 59: "Computer Viruses", our search engine retrieved 1 document on 1 with an average precision of 0.3%. This is because the 300 documents retrieved before the one we were looking for were indeed about viruses and computers, but did not mention any virus name or damage as was requested. Therefore term disambiguation does not help the search engine to understand what kind of documents are expected. A question such as the one above requires the text to be read and understood in order to decide whether it is actually a correct match. Conclusion Intuitively, Word Sense Disambiguation should improve the quality of information retrieval systems. However, as already observed in previous experiments, this is only true in some specific situations, for instance when the disambiguation process is almost perfect, or in limited domains. The observations presented here seem to support this statement. We propose two types of explanations: 1. When a query is large enough (more than one or two words), the probability that a document containing these words uses them with a meaning different from the intended one is very low. For instance, it is unlikely that a document containing mouse, cheese and cat is in fact about a computer mouse. This probably makes WSD useless in many situations. Such a request is similar in nature to the narrativebased tests. On the other hand, the WSD approach could make more sense when requests include only one or two words (which is the most frequent case in standard searches). 2. WSD is a very partial semantic analysis which is insufficient to really understand the queries. For instance, consider the query "Computer Viruses" whose narrative is "Relevant documents should mention the name of the computer virus, and possibly the damage it does". To find relevant documents, a system must recognize phrases which contain virus names ("the XX virus", "the virus named XX", "the virus known as XX", etc.). It should also recognize phrases describing damages ("XX erases the hard disk", "XX causes system crashes" but not "XX propagates through mail messages"). These tasks are very difficult to perform and they are far beyond the scope of WSD. Moreover, they require specific domain knowledge, as shown in [rf06] . The modifications brought to our stop-word lists showed that our search engine is more sensitive to various adjustments of its internal parameters than to the use of a WSD system. Indeed, when we ran a new series of tests with English-only stop words (which eliminated some terms in the requests, such as "eu" and "un"), our new score for the LEM-TDN (which was our best result in this task) increased from 39.17% to 39.63%. Finally, as we argued in [grf05], conceptual indexing is a promising approach for language-independent indexing and retrieval systems. Although an efficient WSD is essential to create good conceptual indexes, we showed in [grf05] that ambiguous indexes (with several concepts for some terms) are often sufficient to reach a good multilingual retrieval performance, for the reasons mentioned above.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UniGe at CLEF 2009 Robust WSD Task (abstract only)

For our second participation to the Robust Word Sense Disambiguation (WSD) Task, we focused on performing a deep analysis of the ambiguity issue in the field of Information Retrieval. During the 2008 edition, we noted that although the WSD corpus allowed lifting lexical ambiguities, our results based on the corpus' WSD were not clearly better than those based on words only. We showed that lexic...

متن کامل

The upv-unige-CIAOSENSO WSD system

The CIAOSENSO WSD system is based on Conceptual Density, WordNet Domains and frequences of WordNet senses. This paper describes the upvunige-CIAOSENSO WSD system, we participated in the english all-word task with, and its versions used for the english lexical sample and the WordNet gloss disambiguation tasks. In the last an additional goal was to check if the disambiguation of glosses, that has...

متن کامل

Learning a Robust Word Sense Disambiguation Model using Hypernyms in Definition Sentences

This paper proposes a method to improve the robustness of a word sense disambiguation (WSD) system for Japanese. Two WSD classifiers are trained from a word sense-tagged corpus: one is a classifier obtained by supervised learning, the other is a classifier using hypernyms extracted from definition sentences in a dictionary. The former will be suitable for the disambiguation of high frequency wo...

متن کامل

Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features

This paper shows that our WSD system using rich linguistic features achieved high accuracy in the classification of English SENSEVAL2 verbs for both fine-grained (64.6%) and coarse-grained (73.7%) senses. We describe three specific enhancements to our treatment of rich linguistic features and present their separate and combined contributions to our system’s performance. Further experiments show...

متن کامل

Learning the Latent Semantics of a Concept from its Definition

In this paper we study unsupervised word sense disambiguation (WSD) based on sense definition. We learn low-dimensional latent semantic vectors of concept definitions to construct a more robust sense similarity measure wmfvec. Experiments on four all-words WSD data sets show significant improvement over the baseline WSD systems and LDA based similarity measures, achieving results comparable to ...

متن کامل

Trajectory Based Word Sense Disambiguation

Classifier combination is a promising way to improve performance of word sense disambiguation. We propose a new combinational method in this paper. We first construct a series of Naïve Bayesian classifiers along a sequence of orderly varying sized windows of context, and perform sense selection for both training samples and test samples using these classifiers. We thus get a sense selection tra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008